The Volvo Service Desk Dataset used in this project was first released for the BPI Challenge 2013:
Ward Steeman (2013): BPI Challenge 2013, incidents. 4TU.ResearchData. Dataset. https://doi.org/10.4121/uuid:500573e6-accc-4b0c-9576-aa5468b10cee
A lot of event data is recorded by different systems. In order to be able to “mine” (extract, preprocess and analyze) these processes, certain data, so called event data or event logs are needed. These must consist of three certain components:
The process analysis workflow consists of three iterative steps:
Event data is extracted from one or more information systems and transfored to event logs.
Preprocessing is done by aggregation (removing redundant details), filtering (focusing on the analysis) and enrichment (add useful data attributes, e.g. calculated values).
The data is analyized from three perspectives. r - The organizational perspective - focus on the actors of the process (e.g. roles of different doctors and nurses, how do they work together) - The control-flow perspective - focus on the flow and structuredness of the process (e.g. a patients journey through the emergency department) - The performance perspective - focus on the time and efficiency (e.g. how long does it take until a patient can leave the emergency department)
Different perspectives can also be combined with multivariate analysis (e.g. are there links between actors and performance issues) as well as with the inclusion of additional data attributes (e.g. custom activities, costs).
First we load the XES log file to get started:
data <- read_xes("bpi_challenge_2013_incidents.xes")
Next we want to get an overview about the available information within the event log:
data %>% summary()
## Number of events: 65533
## Number of cases: 7554
## Number of traces: 1511
## Number of distinct activities: 4
## Average trace length: 8.675271
##
## Start eventlog: 2010-03-31 14:59:42
## End eventlog: 2012-05-22 23:22:25
## CASE_concept_name activity_id impact
## Length:65533 Accepted :40117 Length:65533
## Class :character Completed:13867 Class :character
## Mode :character Queued :11544 Mode :character
## Unmatched: 5
##
##
##
## lifecycle_id org_group resource_id
## In Progress :30239 Length:65533 Siebel : 6162
## Awaiting Assignment:11544 Class :character Krzysztof: 1173
## Resolved : 6115 Mode :character Pawel : 925
## Closed : 5716 Marcin : 688
## Wait - User : 4217 Marika : 605
## Assigned : 3221 Michael : 587
## (Other) : 4481 (Other) :55393
## org_role organization country organization involved
## Length:65533 Length:65533 Length:65533
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## product resource country timestamp
## Length:65533 Length:65533 Min. :2010-03-31 14:59:42
## Class :character Class :character 1st Qu.:2012-04-27 04:46:48
## Mode :character Mode :character Median :2012-05-02 14:07:37
## Mean :2012-04-25 07:41:31
## 3rd Qu.:2012-05-04 09:37:27
## Max. :2012-05-22 23:22:25
##
## activity_instance_id .order
## Length:65533 Min. : 1
## Class :character 1st Qu.:16384
## Mode :character Median :32767
## Mean :32767
## 3rd Qu.:49150
## Max. :65533
##
data %>% select(lifecycle_id) %>% group_by(lifecycle_id) %>% summarize()
## # A tibble: 13 x 1
## lifecycle_id
## <fct>
## 1 Assigned
## 2 Awaiting Assignment
## 3 Cancelled
## 4 Closed
## 5 In Call
## 6 In Progress
## 7 Resolved
## 8 Unmatched
## 9 Wait
## 10 Wait - Customer
## 11 Wait - Implementation
## 12 Wait - User
## 13 Wait - Vendor
The EDA shows that:
Since the terminology in bupaR is not the same as it is in the IEEE standard for XES files (bupaR orientates on current literature rather than the standard’s terminology), we also take a look at the meta-information if the XES file was mapped correctly to bupaR’s terminology. As we can see, the mapping is done correctly:
data %>% mapping()
## Case identifier: CASE_concept_name
## Activity identifier: activity_id
## Resource identifier: resource_id
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle_id
Sinces activities describe the flow of the process/ticket, we take a further look on:
# Overall activity count
data %>% activities()
## # A tibble: 4 x 3
## activity_id absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Accepted 40117 0.612
## 2 Completed 13867 0.212
## 3 Queued 11544 0.176
## 4 Unmatched 5 0.0000763
data %>% traces()
## # A tibble: 1,511 x 3
## trace absolute_frequen… relative_frequen…
## <chr> <int> <dbl>
## 1 Accepted,Accepted,Completed 1754 0.232
## 2 Accepted,Accepted,Completed,Completed 524 0.0694
## 3 Accepted,Accepted,Queued,Accepted,Comple… 352 0.0466
## 4 Accepted,Accepted,Queued,Accepted,Accept… 334 0.0442
## 5 Queued,Accepted,Completed,Completed 300 0.0397
## 6 Accepted,Accepted,Accepted,Completed,Com… 230 0.0304
## 7 Accepted,Accepted,Queued,Accepted,Accept… 185 0.0245
## 8 Accepted,Accepted,Queued,Accepted,Queued… 161 0.0213
## 9 Accepted,Accepted,Accepted,Accepted,Comp… 149 0.0197
## 10 Accepted,Accepted,Queued,Accepted,Accept… 122 0.0162
## # … with 1,501 more rows
data %>% trace_explorer(coverage = 0.6)
The EDA shows that:
Processes always depend on resources or actors getting things done. Even in very structured and standardized processes habits and decisions of staff members have impact on the efficiency and effectiveness of the process. Therefore we investigate:
resources(data)
## # A tibble: 1,440 x 3
## resource_id absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Siebel 6162 0.0940
## 2 Krzysztof 1173 0.0179
## 3 Pawel 925 0.0141
## 4 Marcin 688 0.0105
## 5 Marika 605 0.00923
## 6 Michael 587 0.00896
## 7 Fredrik 585 0.00893
## 8 Piotr 554 0.00845
## 9 Andreas 542 0.00827
## 10 Brecht 477 0.00728
## # … with 1,430 more rows
# User Organization: the business area of the user reporting the problem to the helpdesk
data %>% group_by(`organization involved`) %>% summarize(counts=n())
## # A tibble: 1 x 1
## counts
## <int>
## 1 65533
# Function division: The IT organization is divided into functions (mostly technology wise)
data %>% group_by(org_role) %>% summarize(counts=n())
## # A tibble: 24 x 2
## org_role counts
## <chr> <int>
## 1 A2_1 9977
## 2 A2_2 2618
## 3 A2_3 1136
## 4 A2_4 1691
## 5 A2_5 618
## 6 C_1 36
## 7 C_3 2
## 8 C_5 7
## 9 C_6 219
## 10 D_1 1488
## # … with 14 more rows
# ST (support team): the actual team that will try to solve the problem
data %>% group_by(org_group) %>% summarize(counts=n())
## # A tibble: 649 x 2
## org_group counts
## <chr> <int>
## 1 A1 1
## 2 A10 146
## 3 A11 10
## 4 A12 2
## 5 A13 3
## 6 A14 106
## 7 A15 2
## 8 A16 2
## 9 A17 4
## 10 A18 35
## # … with 639 more rows
# Ticket owner (responsible for ticket during its lifecycle), works in a support team
data %>% group_by(resource_id) %>% summarize(counts=n())
## # A tibble: 1,440 x 2
## resource_id counts
## <fct> <int>
## 1 - 30
## 2 Aaron 37
## 3 Abby 83
## 4 Abdelkader 1
## 5 Abdul 83
## 6 Abhijit 2
## 7 Abhimanyu 2
## 8 Abhinav 26
## 9 Abhiseka 77
## 10 Abhishek 6
## # … with 1,430 more rows
# Products serviced
data %>% group_by(product) %>% summarize(counts=n())
## # A tibble: 704 x 2
## product counts
## <chr> <int>
## 1 - - 6
## 2 OTHER 6
## 3 OTHERS 49
## 4 PROD1 15
## 5 PROD102 8
## 6 PROD103 10
## 7 PROD104 317
## 8 PROD105 4
## 9 PROD106 12
## 10 PROD107 58
## # … with 694 more rows
# Incident impact classes
data %>% group_by(impact) %>% summarize(counts=n())
## # A tibble: 4 x 2
## impact counts
## <chr> <int>
## 1 High 2707
## 2 Low 27877
## 3 Major 44
## 4 Medium 34905
# Country of Ticket Owner
data %>% group_by(`resource country`) %>% summarize(counts=n())
## # A tibble: 1 x 1
## counts
## <int>
## 1 65533
# Country of support team and/or function division
data %>% group_by(`organization country`) %>% summarize(counts=n())
## # A tibble: 1 x 1
## counts
## <int>
## 1 65533
# level options: "log", "case", "trace", "activity", "resource", "resource-activity"
data %>% resource_frequency(level = "resource-activity") %>% plot()
data %>% resource_frequency(level = "resource") %>% plot()
data %>% resource_frequency(level = "activity") %>% plot()
#"case", "resource", "resource-activity"
data %>% resource_involvement(level = "resource") %>% plot()
The EDA shows that:
#data %>% resource_map("resource")
The control flow refers to the different successions of activities, each case can be seen as a sequence of activities.Each unique sequence is called a trace of process variance. The process can be analyzed in different by:
Metrics (for specific aspects of the process) - Start and end activities (Entry & Exit points)
data %>% start_activities("activity") %>% plot()
data %>% end_activities("activity") %>% plot()
# Activity presence shows in what percentage of cases an activity is present
data %>% activity_presence()
## # A tibble: 4 x 3
## activity_id absolute relative
## <fct> <int> <dbl>
## 1 Accepted 7552 1.00
## 2 Completed 7546 0.999
## 3 Queued 4511 0.597
## 4 Unmatched 5 0.000662
data %>% activity_presence() %>% plot()
# level options: "log", each "case", each "activity", "resource", "resource-activity"
# Min, max and average number of repetitions
data %>% number_of_repetitions(level = "log")
## min q1 median mean q3 max st_dev iqr
## 0.0000000 0.0000000 1.0000000 0.9310299 2.0000000 3.0000000 0.9648301 2.0000000
## attr(,"type")
## [1] "all"
# Number of repetitions per resource
data %>% number_of_repetitions(level = "resource") %>% plot()
# Number of repetitions per activity
data %>% number_of_repetitions(level = "activity") %>% plot()
Visuals
# A normal process map
data %>% process_map(type = frequency())
# Shows the most frequent traces covering e.g. 60% of the event log
data %>% trace_explorer(type = "frequent", coverage = 0.6)
# Shows the most infrequent traces covering e.g. 10% of the event log
data %>% trace_explorer(type = "infrequent", coverage = 0.1)
# Options: "absolute" or "relative" frequency,
# "relative_antecedent" frequency, e.g. A is x% of time followed by B.
# "relative_consequent" frequency, e.g. C is x% of time preceded by D.
data %>% precedence_matrix(type = "absolute") %>% plot()
We will now concentrate on the time perspective (in general). The process can be analyzed in different by:
Visuals
# A performance process map (shows durations)
data %>% process_map(type = performance())
# FUN = "median","min","max","mean" units = "hours", "days"
data %>% process_map(type = performance(FUN = median, units = "hours"))
# The dotted chart shows distributions of activities over time (x-axis: time, y_axis: cases)
data %>% dotted_chart(x = "absolute", sort = "start", units = "hours")
Metrics (for specific aspects of the process)
# throughput_time (includes active time + idle time)
data %>% throughput_time(level = "log", units = "hours") %>% plot()
# processing_time (sum of the activity durations, excludes time between activities)
data %>% processing_time(level = "activity") %>% plot()
# idle_time (sum of the durations between activities)
data %>% idle_time("log", units = "days") %>% plot()
The first way to linking perspektives is by making use of the granularity levels of the metrics:
e.g. By calculating the processing time at the level of resources, we can linke the organizational and performance perspective: processing_time(level = “resource”)
By analyzing rework by resources, we can link the control-flow and organizational view: number_of_repetitions(level = “resource”)
Alternatively, we might also want to include additional data attributes in the analysis. This can be done by grouping the event log. Incorporating categorical data attributes into the calculation of a process metric can be done using the group_by function, similarly as when working with regular dta in the tidyverse. Grouping on a variable will implicitly split up the event log according to different values of that variable. Any process metric which gests calculated for a grouped event log will be calculated for each group individually. The results for eacht of the group will then be combined in one single outpout, which can also be visualized using the plot function.
This workflow allows us to easily compare different groups of cases. Combining all these ingredients (data attributes, metrics, levels, plots) allows for a very flexible toolset to perform process analysis. Using the piping symbol, each of the different tools can be easily combined, e.g.:
data %>% group_by(impact) %>% number_of_repetitions(level = "resource") %>% plot()
data %>% number_of_repetitions(level = "activity") %>% arrange(activity_id)
## # Description: activity_metric[,3] [4 × 3]
## activity_id absolute relative
## <fct> <dbl> <dbl>
## 1 Accepted 4074 0.102
## 2 Completed 514 0.0371
## 3 Queued 2445 0.212
## 4 Unmatched 0 0
Because of this flexiblity, we can now answer almost every process-related research question you can think of.